HaiBiostat: From Unconditional to Multivariate Bayesian Model Workshop -- Series 1 of 10 -- Gaussian vs Robust Estimation

Hai Nguyen

How to fitting an unconditional model (without predictors)? From here, we can see the philosophy of the Bayesian’s perspective.

Typical response distribution

Scale y	Dist. \(y \sim f(\mu, ...)\)	Inverse link \(\mu = g^{-1}(\eta)\)
Metric	\(y \sim dnorm(\mu,\sigma)\)	\(\mu = \eta\)
Binary	\(y \sim dbern(\mu)\)	\(\mu = logistic(\eta)\)
Nominal	\(y \sim dmultinom(\mu_1, ...,\mu_p)\)	\(\mu_k = \frac{exp(\eta_k)}{\sum_j exp(\eta_j)}\)
Ordinal	\(y \sim dmultinom(\mu_1, ...,\mu_p)\)	\(\mu = \Phi(\frac{\theta_k - \eta}{\sigma}) - \Phi(\frac{\theta_{k-1} - \eta}{\sigma})\)
Count	\(y \sim dpois(\mu)\)	\(\mu = exp(\eta)\)

Fitting normal model for 1 group with no predictors

We just need to estimate both parameters of normal distribution for random variable \(y\).

The diagram of the model structure (adapted from Textbook)

# preparing data
set.seed(03172021)
y <- rnorm(100, mean = 6, sd = 3)
Ntotal = length(y)

Create a data list, include sample mean and standard deviation in it.

dataList = list(
  y = y ,
  Ntotal = Ntotal ,
  mean_mu = mean(y) ,
  sd_mu = sd(y)
)

Running in JAGS

Specify the model.

Recall that in JAGS [@RN733] normal distribution is specified by precision \(\frac{1}{\sigma^2}\) instead of standard deviation or variance. Select an uninformative prior for \(\sigma\) and normal (conjugate) prior for \(\mu\).

What do we suggest as parameters of the normal distribution based on the sample? \[ y_i \sim N(\mu,\tau), \ \text{where} \ \tau = \frac{1}{\sigma^2} \\ \mu \sim N(M,S) \\ \sigma \sim Uniform[Lw,Mx] \ \text{OR} \ Gamma(\alpha, \beta) \]

modelString = "
  model {
    for (i in 1:Ntotal) {
      y[i] ~ dnorm(mu , 1/sigma^2) #JAGS uses precision
    }
    # mu ~ dnorm(mean_mu , 1/(100*sd_mu)^2)
    mu ~ dnorm(mean_mu , Ntotal/sd_mu^2) #JAGS uses precision
    sigma ~ dunif(sd_mu/1000, sd_mu*1000)
  }
  " # close quote for modelString

# Write out modelString to a text file
writeLines(modelString, con="data/TEMPmodel.txt")

Initialize chains.

initsList <- function(){
   upDown <- sample(c(1,-1),1)
   m <- mean(y)*(1+upDown*.05)
   s <- sd(y)*(1-upDown*.1) 
   list(mu = m, sigma = s)
}

Run the chains.

Set parameters:

# Create, initialize, and adapt the model:
parameters = c("mu", "sigma")     # The parameters to be monitored
adaptSteps = 500               # Number of steps to "tune" the samplers
burnInSteps = 1000
numSavedSteps = 50000
nChains = 4 
thinSteps = 1
nIter = ceiling((numSavedSteps*thinSteps)/nChains )

Send model to JAGS:

jagsModel = jags.model("data/TEMPmodel.txt", data=dataList, inits=initsList,
                       n.chains=nChains, n.adapt=adaptSteps)

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 2
   Total graph size: 114

Initializing model

Burn in and run:

# Burn-in:
update(jagsModel, n.iter=burnInSteps)

# Run it
# The saved MCMC chain:
codaSamples = coda.samples(jagsModel, variable.names=parameters,
                           n.iter=nIter, thin=thinSteps)

Check how chains converged.

summary(codaSamples)


Iterations = 1501:14000
Thinning interval = 1 
Number of chains = 4 
Sample size per chain = 12500 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

       Mean     SD Naive SE Time-series SE
mu    5.923 0.2265 0.001013       0.001001
sigma 3.208 0.2325 0.001040       0.001407

2. Quantiles for each variable:

       2.5%   25%   50%   75% 97.5%
mu    5.480 5.769 5.922 6.076 6.366
sigma 2.792 3.044 3.194 3.356 3.698

The parameters are estimated close to what we simulated and very similar to what point estimation would give.

mean(y)

[1] 5.922201

sd(y)

[1] 3.173288

The plot of the samples and the densities of the parameters.

plot(codaSamples)

Plot autocorrelations.

autocorr.plot(codaSamples, ask = F)

Autocorrelation function plot shows that standard deviation effective size must be pretty small.
In fact:

effectiveSize(codaSamples)

      mu    sigma 
51236.85 27414.72

Shrink factor shows that even with long memory for standard deviation distributions converged:

gelman.diag(codaSamples)

Potential scale reduction factors:

      Point est. Upper C.I.
mu             1          1
sigma          1          1

Multivariate psrf

1

gelman.plot(codaSamples)

Observed HDIs of the chains:

lapply(codaSamples,function(z) hdi(as.matrix(z)))

[[1]]
            mu    sigma
lower 5.479068 2.774907
upper 6.362111 3.677393
attr(,"credMass")
[1] 0.95

[[2]]
            mu    sigma
lower 5.488275 2.766543
upper 6.367365 3.661088
attr(,"credMass")
[1] 0.95

[[3]]
            mu    sigma
lower 5.475716 2.778049
upper 6.353808 3.677524
attr(,"credMass")
[1] 0.95

[[4]]
            mu    sigma
lower 5.489742 2.782062
upper 6.386876 3.677348
attr(,"credMass")
[1] 0.95

Running in Stan

modelString = "
data {
    int<lower=1> Ntotal;
    real y[Ntotal];
    real mean_mu;
    real sd_mu;
}
transformed data {
    real unifLo;
    real unifHi;
    real normalSigma;
    unifLo = sd_mu/100;
    unifHi = sd_mu*100;
    normalSigma = sd_mu*100;  // 100*10 times larger than MLE
}
parameters {
    real mu;
    real<lower=0> sigma;
}
model {
    sigma ~ uniform(unifLo, unifHi);
    mu ~ normal(mean_mu, normalSigma); 
    y ~ normal(mu, sigma);
}
" # close quote for modelString

Create a DSO and save it to disk to reuse later or just keep it in memory.

stanDso <- stan_model(model_code = modelString)

Run chains by either using the existing DSO:

stanFit <- sampling(object=stanDso,
                    data = dataList,
                    pars=c('mu', 'sigma'),
                    chains = 2,
                    cores= 2,
                    iter = 5000,
                    warmup = 200, 
                    thin = 1)

Or alternatively, use the description of the model saved in ch16_1.stan directly:

# fit model
fit <- stan(file = "modelString.stan", 
            data = list(Ntotal = length(y),
                        y = y,
                        meanY=mean(y),
                        sdY=sd(y)), 
            pars=c('mu', 'sigma'),
            # control=list(adapt_delta=0.99),
            iter=5000, chains = 2, cores = 2
            )

Objects fit and stanFit should return very similar results. The difference between stan() and sampling() is in the argument object which is DSO. If you expect to repeat same calculations with different data compiling a DSO and reusing it with sampling() is faster.

# text statistics:
print (stanFit)

Inference for Stan model: anon_model.
2 chains, each with iter=5000; warmup=200; thin=1; 
post-warmup draws per chain=4800, total post-warmup draws=9600.

         mean se_mean   sd    2.5%     25%     50%     75%   97.5%
mu       5.92    0.00 0.32    5.30    5.71    5.92    6.14    6.57
sigma    3.22    0.00 0.23    2.80    3.06    3.20    3.36    3.72
lp__  -164.84    0.02 1.01 -167.54 -165.23 -164.53 -164.11 -163.85
      n_eff Rhat
mu     8953    1
sigma  7630    1
lp__   4176    1

Samples were drawn using NUTS(diag_e) at Thu Feb 23 20:07:56 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

# estimates & hdi:
plot(stanFit)

# samples
traceplot(stanFit, ncol=1, inc_warmup=F)

pairs(stanFit, pars=c('mu','sigma'))

stan_scat(stanFit, c('mu', 'sigma'))

stan_hist(stanFit)

stan_dens(stanFit)

# autocorrelation:
stan_ac(stanFit, separate_chains = T)

# or work with familiar coda class:
stan2coda <- function(fit) {
    # apply to all chains
    mcmc.list(lapply(1:ncol(fit), function(x) mcmc(as.array(fit)[,x,])))
}
codaSamples <- stan2coda(stanFit)
summary(codaSamples)


Iterations = 1:4800
Thinning interval = 1 
Number of chains = 2 
Sample size per chain = 4800 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean     SD Naive SE Time-series SE
mu       5.923 0.3225 0.003291       0.003330
sigma    3.216 0.2344 0.002392       0.002609
lp__  -164.839 1.0128 0.010337       0.015893

2. Quantiles for each variable:

          2.5%      25%      50%      75%    97.5%
mu       5.298    5.707    5.922    6.137    6.566
sigma    2.795    3.055    3.201    3.364    3.720
lp__  -167.540 -165.228 -164.526 -164.113 -163.847

plot(codaSamples)

autocorr.plot(codaSamples)

effectiveSize(codaSamples)

      mu    sigma     lp__ 
9392.909 8296.156 4092.009

gelman.diag(codaSamples)

Potential scale reduction factors:

      Point est. Upper C.I.
mu             1       1.00
sigma          1       1.00
lp__           1       1.01

Multivariate psrf

1

gelman.plot(codaSamples)

plot(density(codaSamples[[1]][,1]),xlim=c(0,8),ylim=c(0,3))  # mu, 1st chain
lines(density(codaSamples[[1]][,2]))                         # sigma, 1st chain
lines(density(codaSamples[[2]][,1]),col="red")               # mu, 2nd chain
lines(density(codaSamples[[2]][,2]),col="red")               # sigma, 2nd chain

Or you can use shinystan to do similar analysis of fitted model:

launch_shinystan(fit)

Robust estimation using t-distribution

Create a sample with heavy tails in order to check robust estimation of parameters of normal distribution.

Refer this simulation of Leptokurtic Distributions.

nSample<-1000
sd.Values<-c(2,3.4,.8,2.6)
sd.process<-rep(c(rep(sd.Values[1],50),
                  rep(sd.Values[2],75),
                  rep(sd.Values[3],75),
                  rep(sd.Values[4],50)), 4)
            
plot(sd.process,type="l")

Variable sd.process is a deterministically changing standard deviation. Simulating perfect normal and independent realizations with such different standard deviations make a leptokurtic distribution.

set.seed(05062022)
y <- rnorm(nSample)*sd.process
y <- y[1:300]
plot(y, type="l")

den <- density(y)
plot(den)
lines(den$x, dnorm(den$x, mean(y), sd(y)), col="red")

Ntotal = length(y)

Density plot clearly shows fat tails that will be most likely identified as outliers under the assumption of normal distribution.

Create data list.

dataList = list(
   y = y ,
   Ntotal = Ntotal ,
   mean_mu = mean(y) ,
   sd_mu = sd(y)
   )

Generalized Student distribution

Student distribution with \(\nu\) degrees of freedom appears in a sampling scheme from Gaussian distribution with unknown mean and variance.

In such case the variable \[ X \sim N(\mu, \sigma)\] has the \(z\)-score \[z = \frac{X - \hat{\mu}}{s/\sqrt{n}} = \frac{X - \hat{\mu}}{\hat{\sigma}} \] where \(s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \hat{\mu})^2\), has
\(t\)-distribution with \(n-1\) degrees of freedom.

Then random variable \[ X = \hat{\mu} + \hat{\sigma}z\]

has generalized Student distribution with location parameter \(\hat{\mu}\), scale parameter \(\hat{\sigma}\) and the number of degrees of freedom \(\nu\).
The parameters \(\hat{\mu}\) and \(\hat{\sigma}\) are not exactly corresponding to the corresponding parameters of Gaussian distribution, but \(\hat{\mu} \rightarrow \mu\), \(\hat{\sigma} \rightarrow \sigma\) as \(\nu\) becomes small and \(t\)-distribution becomes Gaussian.
Interpretation of the parameters:

\(\hat{\mu}\) - location
\(\hat{\sigma}\) - scale
\(\nu\) - amount of mass in the tails.

JAGS

modelString = "
model {
   for ( i in 1:Ntotal ) {
      y[i] ~ dt(mu,1/sigma^2,nu)
   }
   mu ~ dnorm( mean_mu , 100/sd_mu^2 )
   sigma ~ dunif( sd_mu/1000 , sd_mu*1000 )
   nu ~ dexp(1/30.0)
}
" # close quote for modelString
# Write out modelString to a text file
writeLines(modelString , con="data/TEMPmodel-jags.txt")

Initialize the model with MLE.

initsList <-function() {
   upDown<-sample(c(1,-1),1)
   m <- mean(y)*(1+upDown*.05)
   s <- sd(y)*(1-upDown*.1) 
   list(mu = m, sigma = s, nu=2)
   }

parameters = c("mu", "sigma", "nu")     # The parameters to be monitored
adaptSteps = 500               # Number of steps to "tune" the samplers
burnInSteps = 1000
nChains = 3 
thinSteps = 1
numSavedSteps=50000
(nIter = ceiling((numSavedSteps*thinSteps)/nChains))

[1] 16667

# Create, initialize, and adapt the model:
jagsModel = jags.model("data/TEMPmodel-jags.txt", data=dataList, inits=initsList, 
                       n.chains=nChains, n.adapt=adaptSteps )

Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 300
   Unobserved stochastic nodes: 3
   Total graph size: 318

Initializing model

Run the chains.

# Burn-in:
update(jagsModel, n.iter=burnInSteps)
# The saved MCMC chain:
codaSamples = coda.samples(jagsModel, variable.names=parameters , 
                           n.iter=nIter, thin=thinSteps)

Explore the results.

summary(codaSamples)


Iterations = 1501:18167
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 16667 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

         Mean     SD  Naive SE Time-series SE
mu    -0.2816 0.1115 0.0004988      0.0006814
nu     4.1298 1.5167 0.0067826      0.0223954
sigma  1.8116 0.1673 0.0007482      0.0021333

2. Quantiles for each variable:

         2.5%     25%     50%     75%    97.5%
mu    -0.5031 -0.3565 -0.2804 -0.2055 -0.06739
nu     2.3196  3.1388  3.7905  4.7075  7.98449
sigma  1.5018  1.6962  1.8053  1.9221  2.15579

mean(y)

[1] -0.4222642

sd(y)

[1] 2.495025

Note that the robust estimate of \(\mu\) is similar, but \(\sigma\) is significantly smaller.

plot(codaSamples)

summary(sd.process)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.80    0.80    2.30    2.18    3.40    3.40

autocorr.plot(codaSamples,ask=F)

effectiveSize(codaSamples)

       mu        nu     sigma 
26967.896  4580.058  6146.000

gelman.diag(codaSamples)

Potential scale reduction factors:

      Point est. Upper C.I.
mu             1       1.00
nu             1       1.01
sigma          1       1.00

Multivariate psrf

1

gelman.plot(codaSamples)

head(codaSamples[1])

[[1]]
Markov Chain Monte Carlo (MCMC) output:
Start = 1501 
End = 1507 
Thinning interval = 1 
              mu       nu    sigma
[1,] -0.35538534 4.495681 1.958775
[2,] -0.06611877 4.258681 1.793895
[3,] -0.05521260 4.103952 1.789491
[4,] -0.24838098 5.631748 1.801702
[5,] -0.24537268 4.109628 1.863111
[6,] -0.19055382 3.354896 2.028280
[7,] -0.56653077 4.627866 1.839904

attr(,"class")
[1] "mcmc.list"

(HDIofChains<-lapply(codaSamples,function(z) hdi(as.matrix(z))))

[[1]]
              mu       nu    sigma
lower -0.5047759 2.070184 1.501770
upper -0.0678133 7.153532 2.149501
attr(,"credMass")
[1] 0.95

[[2]]
               mu       nu    sigma
lower -0.49556250 2.028008 1.475841
upper -0.07011458 6.820492 2.127922
attr(,"credMass")
[1] 0.95

[[3]]
               mu       nu    sigma
lower -0.50150501 1.966809 1.497709
upper -0.06232077 6.961558 2.143415
attr(,"credMass")
[1] 0.95

Non-robust estimate of \(\sigma\) is outside the HDI for all the chains.

Stan

Adapt the model description to t-distribution with additional parameter \(\nu\).

modelString = "
data {
    int<lower=1> Ntotal;
    real y[Ntotal];
    real mean_mu;
    real sd_mu;
}
transformed data {
    real unifLo;
    real unifHi;
    real normalSigma;
    real expLambda;         //New: parameter of prior for nu 
    unifLo = sd_mu/1000;
    unifHi = sd_mu*1000;
    normalSigma = sd_mu*100;
    expLambda=1/29.0;      //New: setting value for expLambda
}
parameters {
    real<lower=0> nuMinusOne; //New: definition of additional parameter nu
    real mu;
    real<lower=0> sigma;
}
transformed parameters {
    real<lower=0> nu;           //New: new parameter nu
    nu=nuMinusOne+1;           //New: shifting nu to avoid zero
}
model {
    sigma ~ uniform(unifLo, unifHi);
    mu ~ normal(mean_mu, normalSigma);
    nuMinusOne~exponential(expLambda);      //New: exponential prior for nu
    y ~ student_t(nu, mu, sigma);           //New: student_t distribution for nu
}
" # close quote for modelString

Create DSO.

stanDso <- stan_model(model_code = modelString)

Run MCMC.

stanFitRobust <- sampling(object=stanDso , 
                     data = dataList ,
                     pars=c('nu','mu', 'sigma'),
                     chains = 3,
                     cores= 3,
                     iter = 50000,
                     warmup = 300, 
                     thin = 1)

Explore the results.

# text statistics:
print(stanFitRobust)

Inference for Stan model: anon_model.
3 chains, each with iter=50000; warmup=300; thin=1; 
post-warmup draws per chain=49700, total post-warmup draws=149100.

         mean se_mean   sd    2.5%     25%     50%     75%   97.5%
nu       4.04    0.01 1.54    2.30    3.10    3.72    4.58    7.69
mu      -0.24    0.00 0.12   -0.49   -0.33   -0.24   -0.16    0.00
sigma    1.80    0.00 0.16    1.50    1.69    1.80    1.91    2.14
lp__  -514.38    0.01 1.26 -517.67 -514.95 -514.05 -513.46 -512.95
      n_eff Rhat
nu    45883    1
mu    87613    1
sigma 62456    1
lp__  54355    1

Samples were drawn using NUTS(diag_e) at Thu Feb 23 20:11:46 2023.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

# estimates & hdi:
plot(stanFitRobust)

# samples
class(stanFitRobust)

[1] "stanfit"
attr(,"package")
[1] "rstan"

rstan::traceplot(stanFitRobust, ncol=1, inc_warmup=F)

pairs(stanFitRobust, pars=c('nu','mu','sigma'))

stan_scat(stanFitRobust, c('nu','mu'))

stan_scat(stanFitRobust, c('nu','sigma'))

stan_scat(stanFitRobust, c('mu','sigma'))

stan_hist(stanFitRobust)

stan_dens(stanFitRobust)

# autocorrelation:
stan_ac(stanFitRobust, separate_chains = T)

stan_diag(stanFitRobust,information = "sample")

stan_diag(stanFitRobust,information = "stepsize",chain = 0)

stan_diag(stanFitRobust,information = "treedepth",chain = 0)

stan_diag(stanFitRobust,information = "divergence",chain = 0)

There seems to be a pattern in relationship between \(\sigma\) and \(\nu\).
Check if there is dependency.

stanRobustChains <- extract(stanFitRobust)
names(stanRobustChains)

[1] "nu"    "mu"    "sigma" "lp__"

plot(stanRobustChains$nu,stanRobustChains$sigma)

plot(rank(stanRobustChains$nu),rank(stanRobustChains$sigma))

Interpret the dependency: Degrees of freedom \(\nu\) are related to sample size \((n-1)\). If the df increases, it also stands that the sample size is increasing; the graph of the t-distribution will have skinnier tails, pushing the critical value towards the mean.

\(\Rightarrow\) t-copulas (guess from empirical copulas)

Explore shiny object.

launch_shinystan(stanFitRobust)

From Unconditional to Multivariate Bayesian Model Workshop – Series 1 of 10 – Gaussian vs Robust Estimation

Typical response distribution

Fitting normal model for 1 group with no predictors

Running in JAGS

Running in Stan

Robust estimation using t-distribution

Generalized Student distribution

JAGS

Stan

Further reading

Corrections

Reuse

Citation